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Abstract 

The design of an effective architecture for image retrieval requires careful consider- 
ation of the interplay between the three major components of a retrieval system: feature 
transformation, feature representation, and similarity function. We introduce a decision 
theoretic formulation of the retrieval problem that enables the design of systems where 
all components are optimized with respect to the same end-to-end performance criteria: 
the minimization of the probability of retrieval error. The new formulation is shown to 
have two appealing properties. First, it leads to an optimal similarity function (the pos- 
terior probability of the query under the database image class) that generalizes many of 
its previously proposed counterparts. Second, it enables a theoretical characterization 
of the impact of the feature transformation and representation in the probability of er- 
ror. In addition to exposing the major limitations of a large body of previous retrieval 
approaches, this characterization allows the derivation of a series of conditions for the 
optimal design of the feature transformation and representation. The search for a prac- 
tical solution that can satisfy these conditions leads to the adoption of an embedded 
multi-resolution mixture representation and originates an efficient algorithm for opti- 
mal feature selection. The resulting retrieval architecture achieves a good compromise 
between retrieval accuracy, invariance, perceptual relevance of similarity judgments, 
and complexity. Extensive experimental results show that decision-theoretic retrieval 
performs well on color, texture, and generic image databases in terms of both retrieval 
accuracy and perceptual relevance of similarity judgments. 
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1 Introduction 

An architecture for content-based image retrieval (CBIR) consists of three fundamental 
building blocks: 1) a feature transformation from the space of image observations (e.g. 
pixels) to a feature space with better retrieval properties, 2) a feature representation that 
compactly describes how each of the database image classes populates this space, and 
3) a similarity function that allows ranking the database classes by similarity to a query. 
While significant attention has been devoted to each of these individual components, 
there have been significantly fewer attempts to investigate the interrelationships among 
them and how these relationships affect the overall performance of retrieval systems. 

In fact, a significant fraction of the retrieval literature can be classified into two 
major groups, according to the emphasis placed on the design of the individual retrieval 
components. The first group contains solutions tailored for texture (to which we refer 
as texture-based retrieval) while the second contains solutions tailored for color (color- 
based retrieval). Texture retrieval approaches tend to place all emphasis on the design 
of the feature transformation. The key idea, which can be explicit in the formulation of 
the problem [71, 75, 17] or only implicit [40, 45, 44, 46], is to find discriminant feature 
transformations. These are transformations that best separate the feature distributions 
of the different image classes. Ideally, given small class overlap, simple similarity 
metrics such as the Euclidean or the Mahalanobis distance (MD) should guarantee 
good retrieval performance. 

On the other hand, discrimination has not been a critical issue for color-based re- 
trieval, where the features are either the pixel colors themselves or color-ratios that 
guarantee different types of invariance [23, 26]. Instead significant work has been 
devoted to the feature representation, consisting mostly of variations on the color his- 
togram [68], e.g. the color coherence vector [52], the color correlogram [28], color 
moments [66], etc. Here, similarity metrics are usually L p norms and, among these, 
the L 1 distance, also known as histogram intersection (HI) [68], has become quite 
popular [68, 56, 58, 41, 59, 64, 66, 1]. 

While they have worked reasonably well in their specific domains, these represen- 
tations break down when applied to generic databases. On one hand, the discrimi- 
nant transformations proposed by texture-based approaches tend to be database spe- 
cific, e.g. discriminant features for a texture database are usually not discriminant for 
an object database, and it is therefore not clear that such approaches can be general- 
ized to the full-blown retrieval problem (where image content is unconstrained). On 
the other, color-based solutions are plagued by the exponential complexity of the his- 
togram on the dimension of the feature space, and are therefore only applicable to 
low-dimensional feature spaces (e.g. the space of pixel colors). Hence, they are unable 
to capture the spatial dependencies that are crucial for characterizing image properties 
such as texture or local surface appearance. 

The alternative to concentrating on the features or feature representation, is to in- 
vestigate the design of retrieval systems that are optimal in some end-to-end sense, i.e. 
where all retrieval components are optimized with respect to the same overall perfor- 
mance criteria. This, of course, raises the problem of defining a meaningful criteria 
for end-to-end optimality. Since the ultimate goal of any retrieval system is to be cor- 
rect as often as possible, we formalize the retrieval problem as one of decision theory 
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and adopt the criteria of minimizing probability of retrieval error (MPE). The decision- 
theoretic formulation has two major properties of interest. First, it leads to generic 
solutions which are optimal in a sense (MPE) that is meaningful for any type of visual 
databases, e.g. object databases, texture databases, databases of consumer photographs, 
and so forth. Second, it makes a vast body of existing decision-theoretic results relevant 
to the retrieval problem, simplifying the task of designing optimal systems. 

One well known such result is that the optimal similarity function, in the MPE 
sense, is that associated with the Bayes classifier: the posterior probability, under each 
database class, of the features vectors in the query. In this work, Bayesian similarity 
is 1) shown to generalize many of the similarity functions (Mahalanobis distance, x 2 
statistic, and minimum discrimination information, among others) in common use in 
the retrieval literature, and 2) used as a starting point for a decision-theoretic analysis 
of the trade-offs to be satisfied by the retrieval components, when the goal is to achieve 
end-to-end optimality. The main result of this analysis is that any retrieval system must 
indeed achieve a compromise between feature transformation and feature representa- 
tion, taking into account three conflicting constraints: 

• fine image discrimination requires the ability to capture local dependencies be- 
tween image pixels, which can only be achieved through spatially supported 
features, i.e. when the space of image observations is high-dimensional. 

• (MPE) optimal performance is only guaranteed for a restricted set of invertible 
feature transformations; 

• density estimates tend to be poor when the feature space is high-dimensional. 

Because an invertible transformation can only map a high dimensional observation 
space into a high-dimensional feature space, where it is difficult to obtain reliable den- 
sity estimates, it follows that the design of decision-theoretic retrieval systems always 
requires either sacrificing the invertability of the transformation (allowing the feature 
space to be low-dimensional even when the observation space is not), or sacrificing the 
spatial support of the features (by relying on low-dimensional observation spaces). 

Since either of these can have drastic consequences on retrieval accuracy, it is im- 
portant to base design decisions on a solid understanding of all the involved trade-offs. 
To obtain such understanding we introduce the notion of embedded feature spaces, 
which are the spaces obtained by sequential downward projection of a starting feature 
space. Embedded feature spaces are shown to be an intrinsic component of retrieval 
systems with linear feature transformations, in the sense that any such transformation 
originates a sequence of embedded spaces with monotonically decreasing lower bound 
in probability of error and monotonically increasing density estimation error. In re- 
sult, for a given feature transformation, the probability of error is a convex function 
of the number of embedded subspaces considered in the retrieval operation. It follows 
that the problem of optimal feature design can be decoupled into two smaller subprob- 
lems: 1) finding the best invertible feature transformation, and 2) finding the subspace 
dimension where the probability of error achieves its minimum value. 

In general, these are difficult problems which involve iterating between density es- 
timation and feature updating, two steps that must cycle through all image classes in 
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the database. We show, however, that efficient solutions are possible whenever the 
set of transformations of interest is finite, the search restricted to sequences of em- 
bedded subspaces of a common transformation, and the Gauss mixture adopted for 
feature representation. The latter is a particularly interesting result due to the fact that 
Gauss mixtures exhibit three other properties that are appealing in the retrieval context: 
computational tractability in high-dimensional spaces, ability to approximate arbitrary 
densities, and compactness. Overall, this leads to the notion of a sequence of embed- 
ded mixture models. Given a mixture density defined on a starting feature space, this 
is simply the sequence of mixtures resulting from the projection into the associated 
embedded subspaces. 

Once the Bayesian similarity criteria and the embedded mixture representation are 
in place, it remains to determine the best finite set of feature transformations to con- 
sider during feature design. Here we simply draw on what is known about the human 
visual system and consider the set of multi-resolution transforms. This leads to the no- 
tion of embedded multi-resolution mixtures (EMM), which are families of embedded 
densities ranging over multiple image scales. EMMs are shown to generalize color his- 
tograms, complementing them with the ability to capture spatial image dependencies 
and allowing fine control over the invariance properties of the overall image represen- 
tation. We present a cross-validation algorithm for finding the best multi-resolution 
decomposition, and determining the associated optimal subspace dimension, that is 
computationally efficient and has good retrieval performance. 

Overall, the retrieval architecture composed by the Bayesian similarity criteria, 
a multi-resolution feature transformation, and the embedded mixture representation 
achieves a good compromise between retrieval accuracy, invariance, perceptual rel- 
evance of similarity judgments, and complexity. We illustrate these properties with 
an extensive experimental evaluation on three different databases that stress different 
aspects of the retrieval problem: the Brodatz texture database, the Columbia object 
database, and the Corel database of stock photography. In all cases, the new approach 
outperforms solutions representative of the state-of-the-art both in terms of objective 
(precision/recall) and subjective (perceptual) evaluation. 

The paper is organized as follows. Section 2 establishes some notation to be used 
in the remaining sections. Section 3 introduces the decision-theoretic retrieval for- 
mulation, reviews some known results, and establishes the relationships between the 
Bayesian similarity criteria and various other similarity functions in common use in 
the literature. A theoretic characterization of the impact of the feature transformation 
and representation on the overall retrieval performance is then carried out on section 4. 
This characterization is also used to expose the major limitations of the color-based 
and texture-based retrieval strategies. Section 5 introduces embedded feature spaces 
and translates the theoretical results of Section 4 into a series of conditions for the 
design of the optimal feature transformation and representation. The embedded multi- 
resolution mixture representation is then introduced in section 6, which also provides 
an algorithm for optimal feature design. Finally, an experimental evaluation of the 
various aspects of decision-theoretic retrieval is presented in section 7. 
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2 Terms and notation 

We start by introducing some notation. The basic element of image representation is 
an image observation. This can be a single pixel or a number d of them located in a 
pre-defined spatial neighborhood. We denote the space of observations by Z C M d . 
The scalar d is always used to denote the dimension of the space Z. Observations are 
mapped into feature vectors by a linear transformation 

T : Z -> X. (1) 

We refer to X as the feature space, and x = T(z) a feature vector. Features are the 
elements of a feature vector. The matrix that defines the transformation T is denoted 
byT. 

A feature representation is a probabilistic model for how each of the image classes 
in the database populates the feature space X. We introduce a class indicator vari- 
able Y € {1, . . . , M} and denote the probability density function (pdf) of class i by 
Px|y (X = x|Y = i). This also illustrates the following conventions for notation: ran- 
dom variables are represented in upper-case while values appear in lower-case, vectors 
are represented in boldface while scalars appear in normal type. Whenever the mean- 
ing is clear from context we replace the expression above by the simpler Px|y(x|i). 
Finally, we frequently write X € A 1 to indicate that the random variable X takes values 
in X. 

Throughout this work we assume that feature vectors are independent and identi- 
cally distributed (iid), e.g. 

JV 

Px 1 ,...,x jv (xi ! ... ! xjv) = JJP x (x j ). 

J=l 

One distribution that we will encounter frequently is the Gaussian, of mean fi and 
covariance S, 

Px(x)=e(x ! p !S ) = - 7 =l= e -^^lll (2) 

where 

||x-A*||e = -^(x-^S-^x-/*) (3) 

is the quadratic norm defined by The Euclidean norm is the particular case in 

which S = I. Another model that we will frequently refer to is the histogram. The 
histogram of a collection of feature vectors S is a vector f = {/i , . . . , /r} associated 
with a partition of the feature space X into R regions {Ci, . . . , Cr}, where f r is the 
number of vectors in S landing on cell C r . 

3 Decision-theoretic image similarity 

In the CBIR context, image similarity can be formulated as a problem of statistical 
classification. Given the feature space X, a retrieval system is simply a map 

g:X^{l,...,M} 



3.1 A unified view of image similarity 
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from X to the index set of the M classes in the database. In this sense, it is natural 
to adopt a decision-theoretic formulation of the retrieval problem, where the goal is to 
design systems that have minimum probability of retrieval error, i.e. that are wrong as 
rarely as possible. 

Definition 1 A minimum probability of error (MPE) retrieval system is the mapping 

g:X^{l,...,M} 

that minimizes 

Px,y( 9 (X)^F). 

Under this definition, the optimal similarity function is well known [15]. 

Theorem 1 Given a feature space X and a query x, the similarity function that min- 
imizes the probability of retrieval error is the Bayes or maximum a posteriori (MAP) 
classifier 

5*(x) = argmaxPy|x(«|x). (4) 

i 

Furthermore, the probability of error is lower bounded by the Bayes error 

L^ = l-£ x [maxP y | X (i|x)] ! (5) 

where E x means expectation with respect to Px(x). 

Proof: See appendix A.l. 

One way to implement the MAP classifier is with recourse to Bayes rule 

JV 

g*(x) = argmax]J P X |y(x j |i)P y (y = i) 

JV 

= arg max ^ log P X |y (x^ \i) + log P Y (i) , (6) 

where we have used the iid assumption for X. Equation (6) is denoted by Bayesian 
retrieval criteria and image retrieval based on it as decision-theoretic retrieval (DTR). 

3.1 A unified view of image similarity 

In this section, we analyze the relationships between the Bayesian retrieval criteria 
and a significant number of previously proposed similarity functions. The goal is to 
show that many of the latter can be derived from the decision-theoretic principles at the 
core of the former, by making various assumptions or approximations. This not only 
demonstrates that, in general, these alternatives cannot lead to superior performance 
but also enables a principled understanding of their limitations and applicability to 
different retrieval contexts. 
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The assumptions/approximations required to derive several popular similarity func- 
tions from the Bayesian criteria are depicted in Figure 1. If an upper bound on the 
Bayes error of a collection of two-way classification problems is minimized instead 
of the probability of error of the original problem, the Bayesian criteria reduces to the 
Bhattacharyya distance (BD). On the other hand, if the original criteria is minimized, 
but the different image classes are assumed to be equally likely a priori, we have the 
maximum likelihood (ML) retrieval criteria. As the number of query vectors grows to 
infinity the ML criteria tends to the minimum discrimination information (MDI), which 
in turn can be approximated by the \ 2 test by performing a simple first order Taylor 
series expansion. Alternatively, MDI can be simplified by assuming that the underlying 
probability densities belong to a pre-defined family. For auto-regressive sources it re- 
duces to the Itakura-Saito distance that has received significant attention in the speech 
literature. In the Gaussian case, further assumption of orthonormal covariance matri- 
ces leads to the quadratic distance (QD) frequently found in the compression literature. 
The next possible simplification is to assume that all classes share the same covariance 
matrix, leading to the MD. Finally, assuming identity covariances results in the square 
of the Euclidean distance (ED). We next derive in more detail all these relationships. 
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Figure 1 : Relations between different image similarity functions. 



3.1 A unified view of image similarity 



1 



3.1.1 Bhattacharyya distance 

If there are only two classes in the classification problem, (5) can be written as [22] 
L* x = £ x [min(P y | X (0|x),P y |x(l|x))] 

= j P x (x)min[P y |x(0|x),P y | X (l|x)]dx 

= j min[P X |y(x|0)Py(0) ! Px|y(x|l)P y (l)]dx 

< VPy(0)P y (1) J y / P X |y(x|0)P X |y(x|l)dx ! 

where we have used the bound min[o, b] < Vab- The last integral is usually known as 
the Bhattacharyya distance between Px|y (x|0) and Px|y (x| 1) and has been proposed 
(e.g. [47, 1 1]) for image retrieval where, for a query density Px(x), it takes the form 

#(x)=argmin J y^P x (x)P X |y (x|i)dx. (7) 

The resulting classifier can thus be seen as the one which finds the lowest upper-bound 
on the Bayes error for the collection of two-class problems involving the query and 
each of the database classes. 

Whenever it is possible to solve the minimization of the error probability on the 
multi-class retrieval problem it makes small sense to replace it by the search for the two 
class problem with the smallest error bound. Consequently, the above interpretation of 
the BD makes it clear that, in general, there is small justification to prefer it to DTR. 



3.1.2 Maximum likelihood 

It is a straightforward consequence of (6) that, when all image classes are a priori 
equally likely, Py(i) = 1/M, 

1 N 

g(x) = argmax — ^logP X |y(xj|i). (8) 

This decision rule is known as the maximum likelihood classifier. While class priors 
Py(i) can provide a useful mechanism to 1) account for the context in which the re- 
trieval operation takes place, 2) integrate information from multiple content modalities 
that may be available in the database, and 3) design learning algorithms [81, 79], in 
this work we assume that there is no a priori reason to prefer any given image over the 
rest. In this case, Bayesian and maximum likelihood retrieval are equivalent. 
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3.1.3 Minimum discrimination information 

If Hi, i = 1,2, are the hypotheses that x is drawn from the statistical population with 
density P;(x), the Kullback-Leibler divergence (KLD) or relative entropy [33, 13] 

KL[P 2 (x)||Pi(x)] = J P 2 (x) log ^|dx (9) 

measures the mean information per observation from P 2 (x) for discrimination in favor 
of Hi against Hi . Because it measures the difficulty of discriminating between the 
two populations, and is always non-negative and equal to zero only when Pi(x) = 
P 2 (x) [33], the KLD has been proposed as a measure of similarity for various com- 
pression and signal processing problems [27, 36, 18, 10]. 

Given a density Pi (x) and a family of densities M. the MDI criteria [33] seeks the 
density in M that is the "nearest neighbor" of Pi (x) in the KLD sense 

P 2 *(x) = arg min KL[P 2 (x)||Pi (x)]. 

If M is a large family, containing Pi (x), this problem has the trivial solution P 2 (x) = 
Pi (x), which is not always the most interesting. In other cases, a sample from P 2 (x) 
is available but the explicit form of the distribution is not known. In these situations 
it may be more useful to seek for the distribution that minimizes the KLD subject to a 
stricter set of constraints. Kullback suggested the problem 

P*(x)=arg min KL[P 2 (x)||Pi(x)] 

P 2 (x)£M 

subject to 

J T(x)P 2 (x)=0 

where T(x) is a measurable statistic (e.g. the mean when T(x) = x) and 8 can be 
computed from a sample (e.g. the sample mean). He showed that the minimum is 1) 
achieved by 

P 2 '(x) = |e- AT WPi(x) 

where Z is a normalizing constant, Z = J e~ XT ^Pi (x)dx, and A a Lagrange multi- 
plier that weighs the importance of the constraint; and 2) equal to 

KL[P'(x)\\P 1 (x)] = -Xff-\ogZ. 

Gray and his colleagues have studied extensively the case in which Pi (x) belongs to 
the family of auto-regressive moving average (ARMA) processes [27, 19] and showed, 
among other things, that in this case the optimal solution is a variation of the Itakura- 
Saito distance commonly used in speech analysis and compression. Kupperman [34] 
has shown that when all densities are members of the exponential family, the con- 
strained version of MDI is equivalent to maximum likelihood. 



3.1 A unified view of image similarity 
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The KLD has only been recently considered in the retrieval literature [78, 31, 56, 
7, 16], where attention has focused on the unconstrained MDI problem 

g(x) = argmin KL[P x (x)||P X |y(x|i)], (10) 

where Px(x) is the density of the query and P X |y(x|i) that of the i th image class. 
Similarly to the constrained case, it is possible to derive a connection between uncon- 
strained MDI and maximum likelihood. However, the connection is much stronger 
in the unconstrained case since there is no need to make any assumptions regarding 
the type of densities involved. In particular, by simple application of the law of large 
numbers to (8), 

g(x) = argmax£^ x [logPxiy(x|i)l as N — > oo 
= argmax J Px(x) log Px|y (x|i)dx 

= argmin j Px(x) log P x (x)dx - j Px(x) log P X |y (x|i)dx 

= argmin / P x (x)log J > *^\.. dx. 
* J P X |y(x|«) 

= axgmmKL[P x {x)\\Px lY {x\i)}, 

where E x is the expectation with respect to the query density Px(x). This means that, 
independently of the type of densities, MDI is simply the asymptotic limit of the ML 
criteria as the cardinality of the query grows to infinity. This relationship is important 
for various reasons. First, it confirms that the Bayesian criteria converges to a meaning- 
ful similarity function between image densities as the cardinality of the query grows. 
Second, it makes it clear that while ML and MDI perform equally well for image-based 
queries, the Bayesian criteria has the added advantage of also enabling queries based 
on image regions. Finally, it establishes a connection between the Bayesian criteria and 
several similarity functions that can be derived from MDI. 



3.1.4 x 2 test 



The first of such similarity functions is the x 2 statistic. Using a first order Tayli 
approximation for the logarithmic function about x = 1, log(x) m x — 1, we c 

KL[P 1 (x)||P 2 (x)] = |p l( x)log^|dx 

^ /• P 1 (x) 2 -P 1 (x)P 2 (x) 



lor series 
obtain 
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= /( fi(x)2 "p^r aW - p ' w+F ' w ) A 



/ 



(P!(X)-P 2 (X)) 2 

P 2 (x) 



where we have used the fact that J Pj(x)dx = 1, i = 1, 2. In the retrieval context, this 
means that MDI can be approximated by 

g(x) wargmin / dx. (11) 

The integral on the right is known as the x 2 statistic and the resulting criteria a % 2 
test [51]. It has been proposed as a metric for image similarity in [61, 7, 56, 35], 
among others. Since it results from the linearization of the KLD, it can be seen as an 
approximation to the asymptotic limit of the ML criteria. Obviously, this linearization 
can discard a significant amount of information and there is, in general, no reason to 
believe that it should perform better than DTR. 

3.1.5 The Gaussian case 

Several similarity functions of practical interest can be derived from the Bayesian cri- 
teria when the class likelihood functions are Gaussian. In this case, (8) becomes 

g(x) = argminloglSil + ^(x„ - ft) T S j " 1 (x„ - m) 

n 

= argminlog |E,| + (12) 

i 

where 

N SZ( X " ~ A*i) Ts , rl ( x " _ A*i) 

n 

is the quadratic distance (QD) commonly found in the perceptually weighted com- 
pression literature [24, 38]. As a retrieval metric, the QD can thus be seen as the result 
of imposing two stringent restrictions on the generic ML criteria. First, that all im- 
age sources are Gaussian and, second, that their covariance matrices are orthonormal 
= l,Vi). Furthermore, because 



n 

= ^X]( x " _ x + x _ ^) TS i~ 1 ( x « -x + x-^) 

n 

= ^-tracelSr 1 £( x „ - x)(x„ - x) T ] + (x - A i i ) T Sr 1 (x - m) T 
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= trace[Ei 1 t x ] + Mi 



(13) 



where x = 1 /N J2 n x « anc ^ ^ x = 1^ S n ( x « — x ) ( x « — X ) T are ' respectively, the 
sample mean and covariance of x„ and 



the Mahalanobis distance, we see that the MD results from complementing Gaussianity 
with the assumption that all classes have the same covariance (S x = Sj = S, Vi). 

Finally, if this covariance is the identity (S = I), we obtain the square of the 
Euclidean distance (ED) or mean squared error 



The MD, the ED, and variations on both, have been widely used in the retrieval litera- 
ture [64, 41, 1, 65, 49, 62, 54, 44, 53, 6, 56, 70, 59, 3]. 

3.1.6 Some intuition for the advantages of DTR 

The analysis of the Gaussian case emphasizes why there is little justification to prefer 
any of the above three similarity metrics to the Bayesian criteria. Recall that while for 
the latter the similarity function is 



all other three are approximations that arbitrarily discard covariance information. 

As shown in Figure 2, this information is important for the detection of subtle vari- 
ations such as rotation and scaling in feature space. In a) and b), we show the distance, 
under both QD and MD between a Gaussian and a replica rotated by 8 € [0, n]. Plot b) 
clearly illustrates that while the MD has no ability to distinguish between the rotated 
Gaussians, the inclusion of the irace[E j ~ 1 E x ] term leads to a much more intuitive 
measure of similarity: minimum when both Gaussians are aligned and maximum when 
they are rotated byn/2. 

As illustrated by c) and d), further inclusion of the term log (full ML retrieval) 
penalizes mismatches in scaling. In plot c), we show two Gaussians, with covariances 
S x = I and Sj = a 2 1, centered on zero. In this example, MD is always zero, while 
trace[S j ^ 1 S x ] oc 1/ct 2 penalizes small a and log |Sj | oc log a 2 penalizes large a. The 
total distance is shown as a function of log a 2 in plot d) where, once again, we observe 
an intuitive behavior: the penalty is minimal when both Gaussians have the same scale 
(log o~ 2 = 0), increasing monotonically with the amount of scale mismatch. Notice that 
if the log term is not included, large changes in scale may not be penalized at all. 

3.1.7 L p norms 



£i = (x- /ij) T (x- Hi). 



(14) 



QD 




(15) 



Despite all its good properties, the Bayesian retrieval criteria has received small atten- 
tion in the context of CBIR. An overwhelmingly more popular similarity function is 
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Figure 2: a) A Gaussian with mean (0, 0) T and covariance diag(4, 0.25) and its replica 
rotated by 6. b) Distance between the Gaussian and its rotated replica as a function 
of 8/ir under both the QD and the MD. c) Two Gaussians with different scales, d) 
Distance between them as a function of log a 2 under ML, QD, and MD. 
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the L p norm of the difference between densities 



<?(X) = argmin 




|Px(x)-P X |y(x|i)| p dx 



P 



(16) 



These norms are particularly common in the color-based retrieval literature as similar- 
ity metrics for color histograms. Defining q to be the histogram of Q query vectors, 
and p* the histogram of P l vectors from the i th image class, (16) reduces to 



As shown in [68], when the histograms are normalized (J2 r Qr/Q = Z) r Pr/-f" = 
1, Vi), the minimization of the L 1 distance is equivalent to the maximization of the HI 



While (8) minimizes the classification error, (16) implies that minimizing pointwise 
similarity between density estimates should be the ultimate retrieval criteria. Clearly, 
for any of the two criteria to work, it is necessary that the estimates be close to the 
true densities. However, it is known (e.g. see Theorem 6.5 of [15]) that the probability 
of error of rules of the type of (8) tends to the Bayes error orders of magnitude faster 
than the associated density estimates tend to the right distributions. This implies that 
accurate density estimates are not required everywhere for the classification criteria to 
work. 

In fact, accuracy is required only in the regions near the boundaries between the 
different classes, because these are the only regions that matter for the classification 
decisions. On the other hand, the criteria of (16) is clearly dependent on the quality of 
the density estimates all over X. It, therefore, places a much more stringent require- 
ment on the quality of these estimates and, since density estimation is know to be a 
difficult problem [76, 63], it is unlikely that it will perform better than (8). This is 
indeed confirmed by the experimental results presented in section 7. 

4 Decision-theoretic guidelines for image representation 

One of the interesting properties of the DTR formulation is that it enables the de- 
sign of systems where all components are optimized with respect to a common criteria 
(probability of retrieval error). We next analyze how the feature transformation and 
representation impact the overall system optimality. 




g(X) = argmax 



Q 



(17) 



4.1 Feature transformation 

We start by analyzing the role of the feature transformation. 
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Theorem 2 Given a retrieval system with observation space Z and a feature transfor- 
mation 

T:Z^X, 

then 

L* x > L* z (18) 

where L* z and L* x are, respectively, the Bayes errors on Z and X. Furthermore, 
equality is achieved if and only ifT is an invertible transformation. 

Proof: see appendix A.2. 

The last statement of the theorem is a worst-case result. In fact, for a specific 
retrieval problem, it may be possible to find non-invertible feature transformations that 
do not increase Bayes error. What is not possible is to find 1) a feature transformation 
that will reduce the Bayes error, or 2) a universal non-invertible feature transformation 
guaranteed not to increase the Bayes error on all retrieval problems. 

4.2 Feature representation 

While a necessary condition, low Bayes error is not sufficient for accurate retrieval 
since the actual error may be much larger than the lower bound. 

Theorem 3 Given a retrieval system with a feature space X, unknown class probabil- 
ities Py{i), class densities Px|y(x|i), and a decision function 

g(x) = argmaxp X |y(x|i)py(i), (19) 
the actual probability of error is upper bounded by 

P(g(X) ^Y)<L* X + Y,I \Px\Y(x\i)P Y (i) - p X |y(x|i)py(i)|dx. (20) 



Proof: see appendix A.3. 

In the remainder of this work we assume that the classes are a-priori equiprobable, 
i.e. -Py(i) = 1/M, Vi. This leads to the following corollary. 

Corollary 1 Given a retrieval problem with equiprobable classes, a feature space X, 
unknown class conditional likelihood functions P X |y (x|i), and a decision function 

g(x) = argmaxpx|y(x|i), (21) 

the difference between the actual and Bayes error is upper bounded by 

P(g(X) ^Y)-L* x < A g , x (22) 

where 

A g<x = ^lfL[i^| y (x|0||px|y(x|0], (23) 

i 

is the estimation error. 



4.3 Strategies for image representation 
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Proof: see Appendix A.4. 

In summary, the difference between the actual probability of retrieval error and 
the Bayes error is upper bounded by the error in the density estimates. This implies 
that, if the Bayes error is small, accurate density estimation is a sufficient condition for 
high retrieval accuracy. In particular, good density estimation will suffice to guarantee 
optimal performance when the feature transformation is invertible. 

4.3 Strategies for image representation 

The two theorems are convenient tools for analyzing the balance between feature trans- 
formation and representation achieved by any retrieval strategy. We now proceed to do 
so for two strategies in widespread use in the literature. 

4.3.1 The color strategy 

The theorems suggest that all that really matters for accurate retrieval is good density 
estimation. Since no feature transformation can reduce the Bayes error, there seems 
to be no advantage in using one. This is the rationale behind Strategy 1 (SI): avoid 
feature transformations altogether and do all the estimation directly in Z. As Figure 3 
illustrates, the main problem with this strategy is that density estimation can be difficult 
in Z. Significant emphasis must therefore be given to the feature representation which 
is required to rely on a sophisticated density model. One possible solution, that has 
become a de-facto standard for color-based retrieval [68, 56, 58, 41, 59, 64, 66, 1], is 
the histogram. This solution is illustrated in Figure 3 b). 




a) b) 

Figure 3: Example of a retrieval problem with four image classes, a) In the space of 
image observations, the class densities can have complicated shapes, b) Strategy 1 is 
to simply model the class densities as accurately as possible. 

While they work reasonably well when Z is a low-dimensional space, e.g. the 3-D 
space of pixel colors, histograms are of very limited use in high dimensions. This is 
a consequence of the exponential growth of the number of histogram cells with the 
dimension of the space. Since this dimension is proportional to the size of the region 
of support of the observations, accurate histogram-based density estimates can only 
be obtained for very small spatial neighborhoods. Consequently, the representation 
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cannot capture the spatial dependencies that are crucial for fine image discrimination. 
This is illustrated by Figure 4. 




CAN 



YOU 



Figure 4: Two images that, although visually very dissimilar, have the same color 
histogram. 



4.3.2 The texture strategy 

Because accurate density estimation is usually a difficult problem, a feature transfor- 
mation can be helpful if it makes estimation significantly easier in X than what it is 
in Z. The rationale behind Strategy 2 (S2) is to exploit this as much as possible: find 
a feature transformation that clearly separates the image classes in X, rendering esti- 
mation trivial. Ideally, in X, each class should be characterized by a simple parametric 
density, such as the Gaussians in Figure 5, and a simple classifier should be able to 
guarantee performance close to the Bayes error. 





X 



Figure 5: Example retrieval problem with four image classes. Strategy 2 is to find a 
feature transformation such that density estimation is much easier in X than in Z. 



Strategy S2 has become prevalent in the texture literature, where numerous feature 
transformations have been proposed to achieve good discrimination between different 
texture classes [64, 41, 54, 44, 55, 17, 45, 73, 57, 69, 9, 71]. These transformations are 
then combined with simple similarity functions, like the Mahalanobis and Euclidean 
distances or variations of these, that assume Gaussianity in X. More recently it has 
also been embraced by many retrieval systems [6, 49, 70, 59, 56, 64, 41, 53, 3]. 

The main problem of strategy S2 is the assumption that it is always possible to find 
a transformation that maps a collection of complicated densities in Z into a collection 
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of Gaussians in X, without compromising Bayes error. This is usually not possible and, 
for acceptable levels of Bayes error, feature densities tend to have non-trivial shapes, 
e.g. are multi-modal. Assuming a Gaussian model on X can therefore lead to poor 
density estimates and significant penalties in retrieval precision. 

5 Optimal retrieval systems 

The two standard strategies can be seen as two ends of a continuum: while S 1 is intran- 
sigent with respect to any loss in Bayes error and therefore asks too much of the feature 
representation; S2 constrains the representation to trivial models, expecting the feature 
transformation to do the impossible. It seems that a wiser position would be to stand 
somewhere in between. Since the overall probability of error is upper bounded by the 
sum of the Bayes and estimation errors, we need to consider the two simultaneously. 

5.1 Optimal feature representation 

With respect to the feature representation, Theorem 3 shows that (whatever the feature 
transformation may be) there are no guarantees of small probability of error if the 
density estimates are inaccurate. The quality of these estimates is determined by two 
factors: the choice of probabilistic model, or feature representation, and the estimation 
of the parameters of this model. 

5.1.1 Parameter estimation 

The following lemma shows that, for a given parametric density family, the optimal 
parameters are obtained by standard maximum likelihood estimation. 

Lemma 1 Consider a retrieval problem with equiprobable classes, a feature space X, 
a decision function 

g(x;p) = argmaxP X |y(x|i;Pi), (24) 

where Px|y ( x |i; Pi) is a pdf parameterized by -pi, and a sequence of samples {xj,i £ 
1, . . . , M}, where contains N iid feature vectors from the i th image class in the 
database. Then the upper bound on the density estimation error 

A g , x = ^KL[P X |y(x|i)||Px|y(x|i; Pi )], (25) 

i 

is minimized if 

1 N 

Pi = argmax— VlogP x iy(x iifc |i;p). (26) 
p iV i ~ l 

k=l 

Proof: see Appendix A.5 

The main difficulty posed by feature representation is, therefore, to determine what 
is the best parametric family for density estimation. 
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5.1.2 Parametric density families 

We have seen in the previous section that the Gaussian and histogram models can im- 
pose strong limitations on retrieval accuracy. There are, however, several more so- 
phisticated density models including vector quantizers [25], decision-trees [8], mix- 
tures [72], and kernel-based representations [63]. While all of the latter overcome the 
main limitations of the former, they introduce some problems of their own. 

For example, kernel-based density estimates do not provide a compact description 
of the underlying density (their complexity is proportional to the number of feature 
vectors in the training set) and lead to a similarity function (8) that is too complex for 
most retrieval applications. On the other hand, vector quantizers and decision-trees 
assume a partition of the feature space into mutually exclusive cells that can originate 
significant fluctuations of the density estimates in the presence of small variations of 
the true density [80]. In fact, these representations can be seen as generalizations of 
the histogram that, while overcoming the problem of exponential complexity in the 
dimension of the space, still exhibit all the limitations associated with a partition of 
the feature space into non-overlapping cells. Such limitations are avoided by mixture 
models. 

Definition 2 A mixture density is a density of the form 

C 

Px(x) = £ P xlw (x\w)P w (w), (27) 

w=l 

where {P x \ w (x\w)}C =1 is a sequence of mixture components. 

Mixture models are particularly well suited for the retrieval problem due to four 
main properties. First, because the mixture inherits the complexity of its components, 
it is tractable in high dimensions whenever the components are. In the Gaussian case, 
complexity is only quadratic in the dimension of the space (linear for Gaussians of 
diagonal covariance). Mixtures are therefore significantly more tractable than his- 
tograms. Second, like histograms, mixtures can approximate arbitrary densities. In 
fact, because they rely on smoother kernels, approximations based on mixtures can be 
significantly better than those possible with histograms, vector quantizers, or decision 
trees [63, 39, 37]. Third, as is clear from (27), the complexity of a mixture is linear 
in the number of components C, which is usually small. Hence, unlike kernel-based 
methods, mixtures provide a compact representation of the underlying density. 

In this sense, mixtures combine the good properties of the Gaussian, histogram, 
and kernel-based models: computational tractability, smoothness, and expressiveness. 
A fourth property, which is particularly relevant in the context of this work, is that once 
a set of parameter estimates is available for a density defined on X, the corresponding 
parameters on a sequence of important subspaces are automatically determined. We 
will return to this issue in section 6. 

5.2 Optimal feature transformations 

Unlike the feature representation, which affects only the estimation error, the choice of 
feature transformation has impact in both the Bayes and estimation errors. While the 
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impact on the Bayes error is direct (the Bayes error depends uniquely on the feature 
transformation), the impact on the estimation error is more subtle. It derives from 
the phenomena known as the curse of dimensionality: for a given amount of training 
data, the quality of density estimates degrades as the dimension of the feature space 
increases. The design of an optimal feature transformation must, therefore, account 
for both the Bayes and estimation errors. To understand the associated trade-offs we 
introduce the notion of embedded feature spaces. 

5.2.1 Embedded feature spaces 

Definition 3 Given two vector spaces X m and X n , m < n, such that dim(X m ) = m 
and dim(X n ) = n an embedding is a mapping 

e : X m — > X n (28) 

which is one-to-one. 

A canonical example of embedding is the zero padding operator for Euclidean spaces 

t™ : R m -> R" (29) 
where t JJ,(x) = (x, 0), x € R m , and 0 £ R"~™. 

Definition4 A sequence of vector spaces {Xi, . . . , X4}, such that dim(Xi) < dim(Xi + i), 
is called embedded if there exists a sequence of embeddings 

ti-.Xi-* *Ui> * = 1, 1, (30) 

such that X' i+1 C Xi + i . 

The inverse operation of an embedding is a submersion. 

Definition 5 Given two vector spaces X m and X n , m < n, such that dim(X m ) = m 
and dim(X n ) = n a submersion is a mapping 

1 :X n ^X m (31) 

which is surjective. 

A canonical example of submersion is the projection of Euclidean spaces along the 
coordinate axes 

7T™ : R" -> R™ (32) 

where 7r m (x\ , . . . , x m , £ m _|_i , . . . , x n ) — (x\ , . . . , x m ^. 

The following theorem shows that any linear feature transformation originates a 
sequence of embedded vector spaces with monotonically decreasing Bayes error, and 
monotonically increasing estimation error. 
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Theorem 4 Let 

T : R d -> A" C R d , 
be a linear feature transformation. Then, 

X i = Trf(X),i = l,...,d-l (33) 

is a sequence of embedded feature spaces such that 

L* Xi+1 <L* Xi . (34) 

Furthermore, if~K d = {Xi, . . . , X^} is a sequence or random variables such that 

Xi = 7rf(X),i = l,...,d (35) 
a«J {g(x)}f a sequence of decision functions 

3i(x) = argmaxp Xi jy(x|fc) (36) 
A Si+a ,* i+a > A w ,* ( . (37) 

Proof: see Appendix A. 6. 

It follows that, in general, it is impossible to minimize the Bayes and estimation 
errors simultaneously. On one hand, given a feature space Xi it is usually possible to 
find a subspace where density estimates are more accurate. On the other, the projection 
onto this subspace will increase the Bayes error. The practical result is that there is 
always a need to reach a compromise between the two sources of error. This is illus- 
trated by Figure 6 which shows the typical evolution of the upper and lower bounds on 
the probability of error as one considers successively higher-dimensional subspaces of 
a feature space X. 

Since accurate density estimates can usually be obtained in low-dimensional spaces, 
the two bounds tend to be close when the subspace dimension is small. In this case, the 
probability of error is dominated by the Bayes error. For higher-dimensional subspaces 
the decrease in Bayes error is canceled by an increase in estimation error and the actual 
probability of error increases. Overall, the curve of the probability of error exhibits 
the convex shape depicted in the figure, where an inflection point marks the subspace 
dimension for which Bayes error ceases to be dominant. To achieve optimality, in the 
MPE sense, a retrieval system must therefore operate on the inflection point with the 
smallest probability of error. 

5.2.2 Optimality criteria 

It is straightforward to show (see (52) in the proof of Theorem 1) that a retrieval system 
with class densities P X ]y( x N)> an d decision function (19) has probability of error 



P[g(X) ?Y] = 1- E x [P Yl x(Y = aigmaxpy|xO>)|x)]. (38) 

'' 3 
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Figure 6: Upper bound, lower bound, and probability of error as a function of subspace 
dimension. 



Nevertheless, because this equation depends on the unknown Pxjr ( x |*)> it is impossi- 
ble to minimize the probability of error explicitly. One solution is to assume that the 
estimates p X ]y ( x |«) are good approximations to the true densities, in which case 

P[g(X) ^ Y] a 1 - £x[maxp y ] X (i|x)]. (39) 

In this regime, it follows from the law of large numbers that, given a training sample 
of image observations z = {zi, . . . , zjv}, the optimal feature transformation is 

1 N 

T = argmax — ^maxpy| X [i|4(z*)]. (40) 

k=l 1 



6 Embedded feature representations 

In general, (40) can be a complicated problem since the optimal feature transformation 
depends on the density estimates Py| X [«|^4(zfc)] which, in turn depend on the feature 
space X. Hence, the optimization must resort to an iterative procedure where densities 
are estimated for a given feature space and the feature transformation is then updated 
according to the new estimates. Each of these steps involves cycling through all the im- 
age classes in the database and performing operations (e.g. density estimation) which 
may themselves be non-trivial from a computational standpoint. Since (40) must be 
solved for each subspace dimension, the procedure is too expensive for most applica- 
tions. 
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6.1 Embedded mixture models 

A simpler alternative is to consider only sequences of embedded subspaces of a com- 
mon transformation. The next lemma shows that this can significantly reduce the com- 
plexity of density estimation. 

Lemma 2 Let X be a feature space, {Xj} a sequence of embedded subspaces accord- 
ing to (33), and Xf a sequence of random vectors according to (35). If, under class i, 
X is distributed according to the Gaussian mixture density 

c 

fx|r(x|i) = ^2^i,c6(^,m,c,^i,c) (41) 

c=l 

then, V j € 1, . . . , d, 

c 

P Xj \ Y (x\i) = X>i, c 0[x,n J Vc,n?E i , c (n?) r ] ) (42) 

c=l 

where T\j = [Ij Od-j]y is the projection matrix associated with ttj, Ij the j x j identity 
matrix, and Od-j a matrix of zeros. 

Proof: see appendix A.7. 

The lemma shows that once a set of parameter estimates is obtained for X, the 
sequence of density estimates in the embedded subspaces Xj is automatically known. 
The collection of densities in (42) is denoted by the family of embedded mixture models 
associated with X. Notice that once an estimate is available for {7Tj lC , pi, c , Sj iC } the 
parameters of P Xj |y(x|i) are obtained by simply extracting the first j components 
of the mean vectors fii >c and the upper-left j x j sub-matrix of the covariances Sj lC . 
Hence, it is not necessary to repeat the density estimation for each subspace dimension 
and the overall complexity is really just that of finding the optimal feature transform in 
X. 

In fact, the lemma suggests an efficient cross-validation procedure to find the opti- 
mal subspace dimension of a given transformation T. The basic idea is to select a set 
of query images I = {Ii, .. ., Iq}, establish the associated retrieval ground truth, and 
use this set to infer the optimal subspace dimension. An algorithmic description of this 
procedure is given in Figure 7. It remains to determine how the feature transformation 
T can itself be found. One possibility, that we explore next, is to restrict the search to a 
finite dictionary of transformations that satisfy some properties known to be important 
for visual recognition, e.g. invariance to certain image mappings or plausibility under 
what is known about human perception. 

6.2 Embedded multi-resolution mixture models 

Ever since the work of Hubel and Wiesel [29], it has been established that 1) visual 
processing is local, and 2) different groups in primary visual cortex (i.e. area VI) are 
tuned for detecting different types of stimulus (e.g. bars, edges, and so on). This indi- 
cates that, at the lowest level, the architecture of the human visual system can be well 



6.2 Embedded multi-resolution mixture models 



23 



subspacedim(7, T, {P X |y(x|i), i = 1, . . . , M}) 
• for each query image I s € I: 

- apply the transformation T to a collection of observations from I s to obtain 
a set of query feature vectors x s = {x Si i , . . . , x Si jv} 

- for each subspace dimension j = 1 , . . . , d 

* for each image class i= \,...,M 

■ apply (42) to obtain the embedded mixtures Pxj|y( x |i) 

■ compute 



• average the retrieval measure across queries Rj = 1/|7| J2 S Rs,j- 

• return the subspace dimension j* = argmaxj Rj and associated performance 
score Rj*. 

Figure 7: Algorithm for determining the optimal subspace dimension for a re- 
trieval problem with feature transformation T, and class densities {P X |y(x|i), i = 



approximated by a multi-resolution representation localized in space and frequency, 
and several "biologically plausible" models of early vision are based on this princi- 
ple [60, 42, 4, 21, 67, 5]. 

A space/space-frequency representation is obtained by convolving the image with 
a collection of elementary filters of reduced spatial support and tuned to different spa- 
tial frequencies and orientations. Several elementary filters have been proposed in the 
literature, including differences of Gaussians [42], Gabor functions [55, 21], and dif- 
ferences of offset Gaussians [42], among others. More recently, it has been shown 
that filters remarkably similar to the receptive fields of cells found in VI [50, 2] can 
be learned from training images, by imposing requirements of sparseness [20, 50] or 
independence [2] in the space/space-frequency coefficients. 

When the feature transform T is a multi-resolution decomposition embedded mix- 
ture densities have an interesting interpretation as families of densities defined over 
multiple image scales, each adding higher resolution information to the characteriza- 
tion provided by those before it. In fact, disregarding the dimensions associated with 
high-frequency basis functions is equivalent to modeling densities of low-pass filtered 
images. In the extreme case where only the first, or DC, coefficient is considered the 
representation is equivalent to the histogram of a smoothed version of the original im- 
age. This is illustrated in Figure 8. 




* sort the p* ■ by decreasing value and, based on the resulting order, 
evaluate some measure of retrieval performance (e.g. precision at 
some level of recall) R s j. 



1 



...,M}. 
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Figure 8: An image from the Corel database (top left), its histogram (top right), and 
projections of the corresponding 64-dimensional embedded mixture onto the DC sub- 
space (bottom left), and the subspace of the two lower frequency coefficients (bottom 
right). 

The embedded multi-resolution mixture (EMM) model (embedded mixtures on a 
multi-resolution feature space) can thus be seen as a generalization of the color his- 
togram, where the additional dimensions capture the spatial dependencies that are 
crucial for fine image discrimination (as illustrated in Figure 4). This generalization 
also enables fine control over the invariance properties of the representation. Since the 
histogram is approximately invariant to scaling, rotation, and translation, when only 
the DC subspace is considered the representation is invariant to all these transforma- 
tions. As high-frequency coefficients are included, invariance is gradually sacrificed. 
Of course, invariance can always be improved by including the proper examples in the 
training sample used to learn the parameters of the model. 

6.3 Optimal features 

Given a finite collection T = {T^\ . . . , T^ F '} of multi-resolution transformations, 
the optimal transformation can be found by exhaustive search based on the algorithm 
of Figure 7. In this case, the only non-trivial issue is how to efficiently estimate the 
densities {Px|y(x|i),i = 1, . . ., M} on the different feature spaces. Notice that if 
TW : Z -> X® and : Z -> X^ are two invertible transformations in T, 

then the transformation T^'™) = o (T^)" 1 maps X® into X^ m \ It follows, 

using arguments similar to those of the proof of Lemma 2, that if in X$ the feature 
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optimal transform(7, T) 

1. select a reference transformation in T, e.g. T^\- 

2. for each image class i = 1, . . . , M, use a standard maximum likelihood estima- 
tion technique, e.g. the expectation-maximization algorithm [14], to determine 
the mixture parameters o/P X (i)|y(x|i); 

3. for each transformation f = 2,...,F 

. | er r( 1 'fl = T("o(T( 1 ))- 1 

• compute, for each image class i = \,...,M, the parameters of 
P X (/)|y(x|i) using (45) and (46). 

• let(j* f ,R* f ) = subspacedim(J,T^),{P X (/)|y(x|i),i = 1,...,M}) 

4. let f* = arg max/ P^ awJ j* = jj, ; 

5. return t£P =irf.(T^) 

Figure 9: Algorithm for determining the best feature transformation, and subspace 
dimension for a retrieval problem with transformation dictionary T. 

distribution is, for class i, 

c 

P X (0|y(x|i) = $> ilC 0(x,Ai{ |C ,I3{ |C ) (43) 

c=l 

then, on^W, 

c 

P X (-)|y(x|0 = ^7r i)C e(x,/i™,S™) (44) 

c=l 

where 

^ = T C'™)^ iC (45) 

S™ = T ('.™)53{ iC (T( ,,Tn ^) T . (46) 

Therefore, it suffices to perform density estimation on a reference subspace, e.g. X^\ 
in order to obtain the mixture parameters associated with all transformations in T. The 
search for the optimal feature transformation can thus be performed with the algorithm 
of Figure 9. 

6.4 Multi-resolution feature transforms 

For a feature transformation T : Z — > X one can define an inverse, reconstruction, 
mapping 

A:X->Z. 
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The columns of the associated matrix A are called basis functions of the transforma- 
tion. When A = T T the transformation is orthogonal. Various popular space/space- 
frequency representations are derived from orthogonal feature transforms. 

Definition 6 The Discrete Cosine Transform (DCT) [32] of size n is the orthogonal 
transform whose basis functions are defined by: 

A(i,j) = a(i)a(j) cos ^ ^ cos ^J^ , 0<i,j,x,y<n (47) 

where a = y/l/n for i = 0, and a = \/2jn otherwise. 

The DCT is widely used in image compression, and previous recognition experiments 
have shown that DCT features can lead to recognition rates comparable to those of 
many features proposed in the recognition literature [77]. It is also possible to show 
that, for certain classes of stochastic processes, the DCT converges asymptotically to 
the following transform [32], 

Definition 7 Principal Components Analysis (PCA) is the orthogonal transform de- 
fined by 

T = D _1 / 2 E T , (48) 
where EDE T is the eigenvector decomposition of the covariance matrix E[zz T ]. 

It is well known (and straightforward to show) that PCA generates uncorrected fea- 
tures, i.e. E[xx T ] = I. In this context, PCA is the optimal redundancy reduction 
transform, i.e. the one that produces the most parsimonious description of the input 
observations. For this reason, PCA has been widely used in both compression and 
recognition [74, 48]. 

While they originate spatial/spatial-frequency representations, the major limitation 
of the above transforms as models for visual perception is the arbitrary nature of their 
spatial localization (enforced by arbitrarily segmenting images into blocks). This can 
result in severe scaling mismatches if the block size does not match that of the image 
detail. Such scaling problems are alleviated by the wavelet representation. 

Definition 8 A wavelet transform (WT) [43] is the orthogonal transform whose basis 
functions are defined by 

A(i,j) = VW* (2*z - i) * (2>y - j) lf^: { ^ 2l) (49) 

where ^(x) is a function (wavelet) that integrates to zero. 

Like the DCT, wavelets have been shown empirically to achieve good decorrelation. 
However, natural images exhibit a significant amount of higher-order dependencies that 
cannot be captured by orthogonal components [50]. Eliminating such dependencies is 
the goal of independent component analysis. 
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Definition 9 Independent Component Analysis (ICA) [10] is a feature transform such 
that 

Px(x) = J]P Xi (x«) (50) 

i 

where X = (Xi , . . . , X4) is the random process from which feature vectors are drawn. 

The exact details of ICA depend on the particular algorithm used to learn the basis 
from a training sample. Since independence is usually difficult to measure and enforce 
if d is large, ICA techniques tend to settle for less ambitious goals. The most popular 
solution is to minimize a contrast function which is guaranteed to be zero if the inputs 
are independent. Examples of such contrast functions are higher order correlations and 
information-theoretic objective functions[10]. In this work, we consider representa- 
tives from the two types: the method developed by Comon [12], which uses a contrast 
function based on high-order cumulants, and the FastICA algorithm [30], that relies on 
the negative entropy of the features. 

7 Experimental evaluation 

In this section, we present an experimental evaluation of DTR and a comparison against 
various retrieval techniques in common use. In the retrieval context, it is desirable to 
rely on a generic representation that can achieve equally good performance for di- 
verse types of imagery. For this reason, we conducted experiments on three different 
databases: the Brodatz texture database, the Columbia object database, and a subset of 
the Corel database of stock photography. While Brodatz provides a good testing ground 
for texture retrieval, color-based methods tend to do well on Columbia. Corel contains 
generic imagery and requires retrieval algorithms that can account for both color and 
texture. In each case, we surveyed the literature to identify a competing technique that 
is representative of the state of the art in each area. 

7.1 Databases and performance evaluation 

The 1008 images in the Brodatz database were divided into two subgroups: a query 
database of 112, and a retrieval database of 896 images. Various previous stud- 
ies have identified the combination of 1) the coefficients of the least squares fit of a 
multi-resolution simultaneous auto-regressive (MRSAR) model to each texture and 2) 
the Mahalanobis distance, as a top performer in this database [54, 40, 41]. Follow- 
ing [45, 40, 44], the MRSAR features were computed using a window of size 21 x 21 
sliding over the image with increments of two pixels in both the horizontal and vertical 
dimensions. Each feature vector consists of 4 SAR parameters plus the error of the fit 
achieved by the SAR model at three resolutions, in a total of 15 dimensions. 

The Columbia database was also split into two subsets: a query database containing 
a single view of each of the 100 objects available, and a retrieval database containing 
9 views (separated by 40°) of each object. It was chosen because it is a database 
where histogram-based methods tend to perform well, allowing a comparison of DTR 
against these techniques. For color histogramming, the 3D color space was quantized 
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by finding the bounding box for all the points in the query and retrieval databases and 
then dividing each axis in b bins. This leads to b 3 cells. Experiments were performed 
with different values of b. Retrieval was based on HI. 

From Corel we selected 15 image classes 1 leading to a total of 1, 500 images. Of 
these, 20% were used on the query database, leaving the remaining 80% for retrieval. 
In addition to the texture and color-based approaches, retrieval performance was com- 
pared against those of two other approaches that are representative of the state of the 
art in the joint modeling of the two attributes: the color correlogram [28] and the linear 
combination of color and texture distances. To combine color and texture distances 
linearly we started by evaluating all the distances between query and database entries 
according to both HI and MRSAR/MD. For each query, we then normalized all dis- 
tances by their mean and variance, clipped all values with magnitude larger than three 
standard deviations, and mapped the resulting interval into [0, 1]. An overall distance 
was then computed for each entry in the database, according to 

d' = (1 - w) x d c + w x d t 

where d c and dt are, respectively, the normalized distances according to HI and MR- 
SAR/MD, and w a pre-defined weight. 

On Corel and Columbia images were converted from the original RGB to the YBR 
color space. The feature space had 64 dimensions per color channel, and features from 
the different channels were interleaved according to the pattern YBRYBR... Mixtures 
of 8 Gaussians were used for the Brodatz and Corel databases and 16 for Columbia. 
Diagonal covariances were used for all Gaussians, and all the mixture parameters were 
learned with the EM algorithm [14]. Each image in the database was considered as an 
independent class. A series of experiments were designed with the goal of evaluating 
the performance of the individual components of the DTR architecture. 

7.2 Similarity function 

The first series of experiments was designed to compare the performance of the Bayesian 
retrieval criteria with that of the, more commonly used, MD or HI similarity functions. 
In order to isolate the contribution of the similarity function from those of the features 
and the feature representation, the comparison was performed with the feature sets and 
representations discussed above for the Brodatz and Columbia databases: color-based 
retrieval was implemented by combining the color histogram with (8) and texture-based 
retrieval by the combination of the MRSAR features with (15). 

Figure 10 presents precision/recall (PR) curves for the two databases. As expected, 
texture-based retrieval (MRSAR) performs better on Brodatz while color-based re- 
trieval (color histogramming) does better on Columbia. However, when the appro- 
priate features and representation are used for the specific database, ML always leads 
to a clear improvement in retrieval precision. In particular, for the texture database, 
combining ML with the MRSAR features and the Gaussian representation leads to an 

'"Arabian horses", "auto racing", "coasts", "divers and diving", "English country gardens", "fireworks", 
"glaciers and mountains", "Mayan and Aztec ruins", "oil paintings", "owls", "land of the pyramids", "roses", 
"ski scenes", "religious stained glass". 
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Figure 10: PR curves for Brodatz (top) and Columbia (bottom). In the legend, MR- 
SAR means MRSAR features, H color histograms, ML maximum likelihood, MD Ma- 
halanobis distance, and I intersection. The total number of bins in each histogram is 
indicated after the H. 
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improvement in precision from 5 to 10% (depending on the level of recall) over that 
achievable with the MD. Similarly, on Columbia, replacing HI by the ML criteria leads 
to an improvement that can be as high as 20%. 

7.3 Feature transformation 

The next set of experiments was designed to assess the retrieval performance of the 
various multi-resolution transformations discussed in section 6.4. For this, we mea- 
sured precision at various levels of recall, on Brodatz and Corel. Since Corel is a larger 
database and contains colored images, 192-dimensional feature space, the queries take 
longer to compute. For this reason, we restricted the analysis to the first 64 dimensions 
(and only considered one of the ICA techniques). In both databases, the relative pre- 
cision of the various transformations was similar at all levels of recall. We, therefore, 
only present curves of precision, as a function of subspace dimension, at 30% recall on 
Brodatz and 10% recall on Corel. These are shown in Figure 11. 

Notice that the precision curves comply with the theoretical arguments of sec- 
tion 5.2.1. Since precision is inversely proportional to the probability of error one 
would expect, from those arguments, the precision curves to be concave. This is in- 
deed the case (there is a large increase in precision from 1 to 8 dimensions that we do 
not show for clarity of the graph) for all transformations. 

In terms of the relative performance of the different transforms, the DCT is the 
top performer for both databases reaching high precision in both cases. On the other 
hand, PCA always performs poorly. This is a significant result, given that PCA has 
been widely used in visual recognition [74, 48]. The performance of the other features 
seems to be significantly more dependent on the database. Wavelets do quite well on 
Corel, but very poorly on Brodatz, ICA does better on Brodatz than on Corel. 

It is important to emphasize that, for a given database, a poor choice of transforma- 
tion can lead to significant degradation of retrieval performance. On Brodatz the peak 
precision of the worst transformation (wavelet) is 10% below that of the best (DCT), 
on Corel the variation is almost 20%. Even for a given transformation, precision can 
vary dramatically with the number of embedded subspaces. For example, the precision 
of the DCT features on Brodatz drops from the peak value of about 92% to about 62% 
when all the subspaces are included. These observations emphasize the importance of 
the feature selection algorithm discussed in section 6.3. 

7.4 Comparison to standard solutions 

Since the DCT features achieved the best performance across databases, they were the 
only features considered in the remaining experiments. In this section, we present a 
comparison of the performance of EMM/ML/DCT combination against those of MR- 
SAR/ML and HI, in the specific databases where the latter work best: texture (Brodatz) 
for MRSAR and color (Columbia) for HI. Figure 12 presents the resulting PR curves, 
showing that EMM/ML/DCT achieves equivalent performance or actually outperforms 
the best of the two other approaches in each image domain. This demonstrates that, de- 
spite its generic nature, the EMM/ML/DCT representation can handle both color and 
texture and should therefore do well on a large spectrum of databases. 
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Figure 1 1 : Top: Precision, at 30% recall, on Brodatz. Bottom: Precision, at 30% recall, 
on Corel. 
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Figure 12: PR curves of EMM/ML/DCT, MRSAR/ML, and HI on Brodatz (top) and 
Columbia (bottom). 
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Visual inspection of the retrieval results suggests that, also along the dimension of 
perceptual relevance, EMM/ML/DCT clearly beats the MRSAR and histogram-based 
approaches. Figure 13, presents representative examples of the three major advantages 
of EMM/ML 2 : 1) when it makes errors, these tend to be perceptually less annoying 
than those of the other approaches, 2) when there are several visually similar classes in 
the database, images from these classes tend to be retrieved together, and 3) even when 
the performance is worse than that of the other approaches in terms of PR, the results 
are frequently better from a perceptual standpoint. 
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Fi gure 1 3 '. Comparison of EMM/ML/DCT retrieval results (left column) with those of HI on Columbia 
and MRSAR/MD on Brodatz (right column). The number above each image indicates the class to which it 
belongs. 

The two pictures on the top row exemplify how EMM/ML/DCT can lead to per- 
ceptually pleasing retrieval results, even when the PR performance is only mediocre. 

2 A significantly larger set of retrieval examples is available from 
http : / /crl . research . Compaq . com/ vis ion/multimedia/ retrieval /default . htm, 
and a more detailed analysis presented in [77]. 
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In this case, while HI retrieves several objects unrelated to the query, EMM/ML/DCT 
only returns objects that, like the query, are made of wood blocks. This is due to the fact 
that, by relying on features with spatial support, EMM/ML/DCT is able to capture the 
local appearance of the object surface. It will thus tend to match surfaces with the same 
shape, texture, and reflection properties. This is not possible with color histograms. 

The two images on the center exemplify situations where both approaches perform 
perfectly in terms of PR, yet the perceptual retrieval quality is very different. MR- 
SAR/ML ranks all the images in the query class at the top, but produces non-sense 
matches after that. On the other hand, EMM/ML/DCT retrieves images that are visu- 
ally similar to the query after all the images in its class are exhausted. This observation 
is frequent and derives from the fact that the MRSAR features have no perceptual jus- 
tification. 

Finally, the pictures on the bottom illustrate how, even when it has higher PR, HI 
frequently leads to perceptually poorer results than EMM/ML/DCT. In this case, im- 
ages of a pear and a duck are retrieved by HI after the images in the right class ('Advil 
box"), even though there are several boxes with colors similar to those of the query in 
the database. On the other hand, EMM/ML/DCT only retrieves boxes, although not in 
the best possible order. 

7.5 Generic retrieval 

We finalize with results from Corel. The top plot on Figure 14 presents a comparison, in 
terms of PR, of MRS AR/MD, HI, the color correlogram, and EMM/ML/DCT. It is clear 
that the texture model alone performs very poorly, color histogramming does signifi- 
cantly better, and the correlogram further improves performance by about 5%. How- 
ever, all these approaches are significantly less effective than EMM/ML/DCT. Simi- 
larly, the bottom plot compares the PR curves of EMM/ML/DCT with those obtained 
by linear weighting of the color and texture distances. Several curves are shown for val- 
ues of w € [0, 1]. It is clear that linear weighting is never better than EMM/ML/DCT. 
Given that, in a realistic retrieval scenario, the value of the optimal weight is not known, 
there are no simple ways to determine it, and the linear combination always requires 
an increase in complexity (distances have to be computed according to the two repre- 
sentations), we see no reason to prefer this type of solution to DTR. 

We conclude with three retrieval examples, shown in Figure 15, from the Corel 
database. These examples illustrate the robustness of DTR to changes in scene layout, 
object position and orientation, or variability in the background. A significantly larger 
set of retrieval examples is available from 

http : / /crl . research. Compaq. com/ vision/multimedia/ retrieval /default .htm. 
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Figure 14: Top: PR on Corel for MRSAR/MD, HI, color correlogram (CAC), and EMM/ML/DCT. Bot- 
tom: Comparison of PR achieved with EMM/ML/DCT and linear weighting of MRSAR/MD and HI for 
different weights. 
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A Proofs of the theoretical results 

For any two integers i and j, the Kronecker delta function is defined by 



A.l Proof of theorem 1 

Proof: The proof can be found in various textbooks (see [15, 22] among many others). 
We include it here because some of the intermediate results are required by subsequent 




(51) 



proofs. 



The probability of error associated with the decision rule g(x) is 



Px,y(5(X)^Y) = / Py| X (y^5(x)|x)P x (x)dx 



= £ x [P y | X (iVs(x)|x)], 



(52) 



where 



P y ,x(y^5(x)|x) = ^P(y^ 5 (X)|X=x,y = i)P y | X (i|x) 



X^ 1 - 5 s(x),i)fV|x(i|x) 




(53) 



and 6ij is the Kronecker delta function defined in (51). It follows that 



Py|x(y ^S(x)|x) > l-maxiV|x(*|x) 



= l-P y | X (F = 9*(x)|x) 



= Py|x(y^(x)|x) 



and, consequently 



£x[Py|xCr ^ 5 (x)|x)] > P x [P F | X (y ? S*(x)|x)]. 
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I.e., any other decision rule will have a larger probability of error than the Bayes clas- 
sifier. Since, from (52), 



the probability of error can never be smaller than the Bayes error. 

A.2 Proof of theorem 2 

Proof: The following proof is a straightforward extension to multiple classes of the one 
given in [15] for the two-class problem. From (5), 



L* x = 1 - £ x [max P y I x (i | x)], 

i 

= l-£ x [maxP y | X (i|T(z))], 

= 1 - £x[max J P y | Z ,x(i|z ! T(z))P Z |x(z|T(z))dz] ! 

= 1 - P x [max J Py| Z (i|z)P Z | X (z|T(z))dz] ! 

= 1 - P x [maxP z | X [Py|z(i|z)|X = T(z)]], 

> 1 - P x [P z | X [ma X Py|z(i|z)|X = T(z)]], 

= 1 -P z [maxPyi z (i|z)] =L* Z , 

i 

where equality is achieved if and only if T is an invertible map, and we have used the 
fact that 

E a \ x [maxPy\ z {i\z)\X = T(z)] > E a \ x [Py\ z (i\z)\X = T(z)], Vi. 
A.3 Proof of theorem 3 

Proof: From (52), 



P x , Y (g*(X)?Y) = l-£ x [P y | X (r = 5*(x)|x)] 



= 1 - P x [maxPy| X («|x)] = L* 



P X}Y (g(X)^Y)-L x = 




(54) 
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and since, Vx G X such that g(x) = g* (x), we have 

P y , x (y f g(x)|x) = p y , x (y ^ 5 *(x)|x), 

this is equivalent to 

P X}Y (g(X)j=Y)-L* x = 

[ [Py\x(Y ± 5 (x)|x) - P y , x (Y f 9 *(x)|x)]P x (x)d S 

where 

P = {x|xG ^Px(x) >0, S (x) ^g*(x)}. 

Letting 

A(x) = P yjx (r ^ 5 (x)|x) - P yjx (F ^ 9 *(x)|x) 
and defining the sets 

p; = {x|xeP, 5 *(x) = »} 
Pi = {x|xeP,5(x)=i}, 

it follows from (53) that, Vx G P* n E h 

A(x)=P y | X (i|x)-Py|x(j|x). 

Since, from (4), 

iV|x(*|x) - Py|x(j|x) > 0 Vx € E*, Vj ^ i, 
from (19) and the fact that P x (x) > 0 Vx G P, 

Px|y(x|j)py(j) Px|y(x|i)py(i) 



P x (x) P x (x) 

defining 



> 0 Vx G P j; Vi ^ j, 



,., , Px|y(x|i)py(i) 
W( * |X) = ^FxTx) ' 

we have, Vx G P* fl Pj, 

A(x) = P y | X (i|x)-Py|x(j|x) 

< Py|x(i|x) -Py|x(j|x) +py|x(i|x) -pyj X (*|x) 
= |^y|x(«|x) - Py|x(j|x) +py|x(i|x) -py| X («|x)| 

< |fy|x(i|x) - py|x(*|x)| + |Py|xO'|x) - Py|xO'|x) 
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and 



f A(x)P x (x)dx < f \P x] y(x\i)P Y (i) - Px\Y(x\i)p Y (i)\d> 

J E*C\Ej JE?nEj 



+ / |P X |y(x|j)iV0') -Px]y(x|j)py(j)|dx. 

JETnEj 

Using the fact that both collections of sets E* and Ej partition E, 
f A(x)P x (x)dx = Y, [ A(x)P x (x)dx 

JE i j JETHEj 

< / l-Px|y(x|i)Py(i) -p X |y(x|i)py(i)|dx + 
/ l^x|y(x|j)Py(j) -p X |y(x|j)py(j)|dx 

= Y] / |Px|y(x|i)Py(i) -p X |y(x|i)py(i)|(ix 
+ / |Px|y(x|i)Py(i) -px|y(x|i)py(i)|dx 

J Ei 

< ^2 l p x|y(x|i)Py(i) -p X ]y(x|i)py(i)|dx 

i J 

where we have also used the fact that E* n Ei = 0. 
A.4 Proof of Corollary 1 

Proof: This is a straightforward consequence of applying Pinsker's inequality [13] 

y |P x (x) - Qx(x)|dx < KL[P x (x)||Q x (x)] (55) 
to (20) under the assumption of equiprobable classes. 

A.5 Proof of Lemma 1 

Proof: From the well known fact that, for any densities P and Q, KL[P\\Q] > 0 [13] 
it follows that A 3t x is minimum if and only if, for all i 

Pi = argminKL[Px|y(x|i)||P X |y(x|i;p)] (56) 
p 



A. 6 Proof of Theorem 4 
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= argmax j P X |y(x|i) log P X |y (x|i; p)dx (57) 

= argmaxP Xi [logP x i y (x|i;p)]. (58) 
p 

The lemma follows by application of the law of large numbers. 

A.6 Proof of Theorem 4 

Proof: The fact that the sequence of vector spaces is embedded follows from (33) since, 

Vie {l,...,d-l} 

Xi = nt +1 (X i+1 ) (59) 

and, consequently, 

4 +1 (^)c^ i+1 . (60) 

Inequality (34) then follows from (59), (18) and the fact that the mappings 7r] +1 (x) are 
non-invertible. To prove (37) we start from Corollary 1, i.e. 

A guXi =Y i KL[P Xi]Y (x\k)\\p XilY (x\k)}, (61) 

k 

where P x .| y (x|fc) is the class-conditional likelihood function for Xj under class k. 
Since, from (59), Xj+i = (Xj, Xj+i) where Xi + i is the i + 1 th coordinate of Xj+i 

^L[P Xi + 1 | y (x|fc)||p Xi+1 |y(x|fc)] = 



/ 



fx, tl |H»Wios Fx '""'' X | l !' i» 

+ 1 Px i+1 |y(x|fc) 



= / P Xi+1 |y(x|fc)log- dx 

■/ PXi+iiXi.yi^+ikr (x),fc) 

+ / P Xi+1 |y(x|fe)log : dx 
J Px. iyK (x) fc) 



/ 



Px i+1 iy(x|fe) 
'px i+1 |x i ,y(x i+1 |7ri +1 (x),ft)P Xi , y (7r| +1 (x)|fc)' 



P Xi+1 ,y(x|fc) log ■ — ^ — dx 



+ / P Xj |y(x fc)log- — ? . rix 
7 1 Px ; |y(x|fc) 

= ^L[P Xi+l]y (x|fc)||p Xi+1 | Xi , y (x i+1 |7ri +1 (x),fc)P Xi | y (7r] +1 (x)|fc)] 
+ JCL[P Xi , y (x|fc)||p Xi | y (x|fc)] 
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> KL[Px ilY (x\k)\\px ilY (x\k)} 

where we have used the non-negativity of the KL divergence [13]. Combining with (61) 
leads to (37). 

A.7 Proof of lemma 2 

Proof: Consider a Gaussian random vector X € M d such that P X |y(x|i) = G(x,p, x , S x ), 
and define Xj = 7r^(X). Since ttj is a linear transformation, Xj is also Gaussian dis- 
tributed. Therefore, Pxj (x) is uniquely determined by its mean and covariance. Using 
the well known relationships 

Ai xj = n^ X! and s xj = n*n x (nf) T , 

it follows that 

P Xj |y(x|i) = e(x,n^ x ,n^ x (n^) T ). 

Applying this result to each of the C components of (41) we obtain (42). 
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